Using String Kernels for Classification of Slovenian Web Documents

نویسندگان

  • Blaz Fortuna
  • Dunja Mladenic
چکیده

In this paper we present an approach for classifying web pages obtained from the Slovenian Internet directory where the web sites covering different topics are organized into a topic ontology. We tested two different methods for representing text documents, both in combination with the linear SVM classification algorithm. The first representation that we have used is a standard bag-of-words approach with TFIDF weights and cosine distance used as similarity measure. We compared this to String kernels where text documents are compared not by words but by substrings. This removes the need for stemming or lemmatisation which can be important issue when documents are in languages other than English and tools for stemming or lemmatisation are unavailable or are expensive to make or learn. In highly inflected natural languages, such as Slovene language, the same word can have many different forms, thus String kernels have an advantage here over the bagof-words. In this paper we show that on classification of documents written in highly inflected natural language the situation is opposite and String Kernels significantly outperform the standard bag-of-words representation. Our experiments also show that the advantage of String kernels is more evident for domains with unbalanced class distribution.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents

Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...

متن کامل

String Kernels

This paper provides an overview of string kernels. String kernels compare text documents by the substrings they contain. Because of high computational complexity, methods for approximating string kernels are shown. Several extensions for string kernels are also presented. Finally string kernels are compared to BOW.

متن کامل

Syllables and other String Kernel Extensions

During the last years, the use of string kernels that compare documents has been shown to achieve good results on text classification problems. In this paper we introduce the application of the string kernel in conjunction with syllables. Using syllables shortens the representation of documents compared to a character based representation and as a result reduces computation time. Moreover sylla...

متن کامل

Fast Kernels for Inexact String Matching

We introduce several new families of string kernels designed in particular for use with support vector machines (SVMs) for classification of protein sequence data. These kernels – restricted gappy kernels, substitution kernels, and wildcard kernels – are based on feature spaces indexed by k-length subsequences from the string alphabet Σ (or the alphabet augmented by a wildcard character), and h...

متن کامل

Generalized Similarity Kernels for Efficient Sequence Classification

String kernel-based machine learning methods have yielded great success in practical tasks of structured/sequential data analysis. In this paper we propose a novel computational framework that uses general similarity metrics and distance-preserving embeddings with string kernels to improve sequence classification. An embedding step, a distance-preserving bitstring mapping, is used to effectivel...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005